Prediction of treatment response after stereotactic radiosurgery of brain metastasis using deep learning and radiomics on longitudinal MRI data

We developed artificial intelligence models to predict the brain metastasis (BM) treatment response after stereotactic radiosurgery (SRS) using longitudinal magnetic resonance imaging (MRI) data and evaluated prediction accuracy changes according to the number of sequential MRI scans. We included four sequential MRI scans for 194 patients with BM and 369 target lesions for the Developmental dataset. The data were randomly split (8:2 ratio) for training and testing. For external validation, 172 MRI scans from 43 patients with BM and 62 target lesions were additionally enrolled. The maximum axial diameter (Dmax), radiomics, and deep learning (DL) models were generated for comparison. We evaluated the simple convolutional neural network (CNN) model and a gated recurrent unit (Conv-GRU)-based CNN model in the DL arm. The Conv-GRU model performed superior to the simple CNN models. For both datasets, the area under the curve (AUC) was significantly higher for the two-dimensional (2D) Conv-GRU model than for the 3D Conv-GRU, Dmax, and radiomics models. The accuracy of the 2D Conv-GRU model increased with the number of follow-up studies. In conclusion, using longitudinal MRI data, the 2D Conv-GRU model outperformed all other models in predicting the treatment response after SRS of BM.


Patient selection
We retrospectively reviewed the medical data of patients with BM between January 2015 and October 2020 for the Developmental dataset.The patients were selected based on the following inclusion criteria: (a) age > 19 years; (b) patients with proven underlying malignancy as a primary source; (c) patients diagnosed with BM with a high likelihood using brain MRI; (d) presence of a precise date of SRS for the BM; (e) underwent baseline MRI on the same date as SRS (pre-SRS); (f) underwent follow-up MRI at least three times after SRS with intervals of > 30 days (first to third post-SRS follow-up); (g) and were followed-up clinically and radiologically after the third follow-up MRI to assess the treatment response of SRS.The exclusion criteria were as follows: (a) history of brain surgery before SRS, (b) history of WBRT before SRS, (c) absent 3D post-contrast T1-weighted images with 1-mm slice thickness from pre-SRS or follow-up MRI, or (d) visible BM nodules < 5 mm on pre-SRS MR images.
For external validation, we additionally enrolled patients with BM between November 2020 and December 2022, adhering to the same inclusion and exclusion criteria established for the Developmental dataset.Given the temporal separation in MRI acquisition dates relative to the Developmental dataset, we designated this dataset as the Temporal test set.

MRI analysis
For the Developmental dataset, we included 194 patients with 369 target BM lesions from 776 MRI examinations (four MRI scans per patient, including one pre-SRS and three post-SRS MRI scans).The data were divided randomly into training and testing datasets in a ratio of 8:2.For the training set, we used 616 MRI scans from 154 patients.For the testing set, we used 160 MRI scans from 40 patients.For the Temporal test set, we included 43 patients with 62 target BM lesions from 172 MRI examinations.We defined measurable disease as a contrastenhancing lesion that could be measured accurately in at least one dimension with a minimum size of 5 mm (modified RANO-BM criteria).The size threshold of the modified RANO-BM criteria is smaller than that of the RANO-BM criteria (10 mm).This modification was suggested by the RANO-BM working group only in the setting of brain MRI with a slice thickness ≤ 1.5 mm 6 .Otherwise, the modified RANO-BM criteria follow the RANO-BM criteria 6 .The maximum diameter of each BM was measured on the representative axial plane.Two neuroradiologists (S.J.C. and L.S. with 9 and 12 years of experience in neuroradiology, respectively) assessed the ground truth for treatment response according to the modified RANO-BM criteria by consensus.Upon determining the ground truth, the reviewers had access to all clinical information and follow-up MRI scans after

Model development for comparison
We developed three arms to compare the performance of treatment response prediction: Dmax, radiomics, and DL.The common processes for arm development included pre-processing, BM segmentation, feature extraction, and sequential modelling.In sequential modelling, we employed machine learning algorithms, capable of capturing significant temporal patterns and feature importance without manual feature engineering.We performed end-to-end prediction modelling for the DL arm, sequential feature extraction and analysis modelling for the radiomics arm, and modelling for the Dmax arm (Fig. 1).
In the pre-processing step, each voxel's spacing and signal intensity on the MR image varied based on the scan parameters.Thus, we resampled the image to obtain a voxel spacing of 0.5 × 0.5 × 0.5 mm 3 .Subsequently, we normalised the image by resampling its signal, excluding the background, which ranged from -1 to 1 based on the signal intensity of the position manually selected in the grey matter.To extract the subregions containing a single BM, we cropped the image into 3D patches with a size of 64 × 96 × 96 voxels based on the BM segmentation labels provided by the mentioned neuroradiologists.
An NVIDIA GeForce GTX 1080 Ti graphics processing unit (NVIDIA, Santa Clara, CA, USA) was used for DL.Furthermore, DL training was conducted using Python 3.8.10 and the PyTorch 1.6.0framework in the Ubuntu 16.04.6operating system.We used the PyCharm (JetBrains s.r.o., Prague, Czech Republic) and Visual Studio Code (Microsoft Corp., Redmond, WA, USA) softwares.
Because we obtained four times of 3D volumes for each BM (pre-SRS and first to third post-SRS follow-ups), we extracted the image features of each volume initially and modelled the four features for treatment response prediction sequentially.For the feature extraction of each volume, we utilised a convolutional neural network (CNN) model with randomly initialised ResNet-34 as the backbone.As an independent comparison arm of the 3D CNN, we used a two-dimensional (2D) CNN for the image analysis, in which 2D patches were derived from three orthogonal slices of each 3D patch.
We used two sequential modelling methods suitable for high-dimensional feature analysis for the preliminary model selection in the DL arm using the four image features from the CNN, each consisting of a 512-dimensional vector.First, we concatenated the four features and applied a fully connected layer, taking a 2048-dimensional vector as its input (simple CNN).Second, we used a gated recurrent unit (Conv-GRU) (Fig. 2) 23 , a deep neural network specialised in sequential modelling of high-dimensional deep features.The two models were trained in an end-to-end manner.Each simple CNN and Conv-GRU model for 10 distinct dataset splits (training:test = 8:2) was trained for the statistical analysis.We selected the model with the best accuracy after applying statistical analysis between the simple CNN and Conv-GRU.In addition, we conducted an ablation study by replacing the CNN and GRU components with other backbones in the 2D Conv-GRU model.
We applied data augmentation that comprised random rotation from -30° to + 30°, random scaling from 0.85 × to 1.15 × , random horizontal flip with 0.5 probability, and random translation from − 10 px to + 10 px for each axis.We used the Adam optimiser and the focal loss function for the learning hyperparameters and set the epoch, batch size, and learning rate to 20, 8, and 1 × 10 −4 , respectively.For cases where the epoch ranged between 10 and 15, we set the learning rate at 1 × 10 −5 .For cases in which the epoch was > 15, the learning rate was set at 1 × 10 −6 to accommodate the higher epoch number.
To enhance the interpretability of the DL models, we conducted a post hoc analysis using a class activation map (CAM).This analysis highlighted the specific subregions of the input images from which the feature vector was extracted predominantly.Specifically, we utilised the Eigen-CAM algorithm, which improves the clarity of the CNN predictions by visualising the principal components of the learned representations from the convolutional layers 24 .For instance, a distinct activation pattern in an enhancing BM nodule boundary between PD and non-PD cases could reveal insights into the model's classification ability.
As we sequentially modelled the CNN-extracted features with GRU, the Dmax and radiomics features of four different time points were analysed by the XGBoost models 25 .Specifically, the radiomics features, which consisted of the first-order statistics, shape, grey level co-occurrence, run length, and size zone matrices, were extracted using the PyRadiomics library 26 .Four Dmax and radiomics features of each patient were concatenated sequentially to formulate the input of the XGBoost models: input vectors were constructed by sequentially concatenating features (e.g.feature1 of time 1 to feature1 of time 4, or Dmax of time 1 to Dmax of time 4).By using all the features from four distinct time points, these approaches allowed models to assess feature importance and select relevant features automatically.The code for the implemented models in this study can be found in: https:// github.com/w-cho/ mri_ convg ru.

Evaluation of model performance
First, we assessed the performance of each model in predicting binary treatment responses (PD versus non-PD).The prediction accuracies were obtained by training all four time points of the serial MRI scans (pre-SRS and first to third post-SRS follow-ups).
To identify the optimal number of follow-up MRI studies for predicting the treatment response after SRS, we evaluated the changes in prediction accuracy according to the number of serial (pre-SRS, pre-SRS to first post-SRS, pre-SRS to second post-SRS, and pre-SRS to third post-SRS) MRI scans for the best-performing model.We modified our Conv-GRU model architecture by increasing the number of sequential inputs from one to four while maintaining consistent experimental settings for dataset splitting, data augmentation, optimisation, loss function, epochs, batch size, and learning rate.The optimal hyperparameter for the focal loss was investigated for each fold using a grid search.www.nature.com/scientificreports/

Statistical analyses
The area under the curve (AUC), specificity, and sensitivity were assessed for each model.We derived the optimal cut-off values for the receiver-operating characteristic analysis from Youden's J statistic.Post hoc tests were performed using the Bonferroni correction for multiple comparisons.We calculated the P-values of the prediction accuracy comparison using a paired t-test performed on the results of the individual splits in each model.In the Temporal test set, the enrolled patients (22 men, 20 women) had a mean age of 64 years.The predominant primary cancer type was lung cancer (72.1%).The mean intervals from the pre-SRS MRI to the first post-SRS MRI, first post-SRS MRI to the second post-SRS MRI, and second post-SRS MRI to the third post-SRS MRI were 2.9, 2.9, and 3.1 months, respectively.The total mean interval from the pre-SRS MRI to the third post-SRS MRI was 8.9 months.Among the 62 enrolled target BM nodules, 15 (24.2%) were classified as PD and 47 (75.8%) were assessed as non-PD according to the modified RANO-BM criteria.

Patient characteristics
In both datasets, all patients underwent pre-SRS MRI using a 1.5-T MR scanner.The subsequent three MRI scans were chosen randomly from either a 1.5-T or 3-T MR scanner.The proportion of 1.5-T scans was consistent across both datasets, with a ratio of 0.55 (424/776 in the Developmental dataset and 94/172 in the Temporal test set).

Performance comparison between the models
At the preliminary model selection level in the DL arm, the AUC of the Conv-GRU was superior to that of the simple CNN in 2D (0.8782 versus 0.8344; P < 0.001) and 3D (0.8311 versus 0.7918; P = 0.007) (Supplementary Tables 1 and 2).The results of ablation study for substituting CNN and GRU components with alternative architectures in the 2D Conv-GRU model are presented in Supplementary Table 3.For the Developmental dataset, the mean AUCs from the 10 distinct dataset splits were 0.8782, 0.8311, 0.8228, and 0.7483 for 2D Conv-GRU, 3D Conv-GRU, Dmax, and radiomics, respectively (Table 2).For the Temporal test set, the mean AUCs were 0.8341, 0.7836, 0.7516, and 0.7779 for 2D Conv-GRU, 3D Conv-GRU, Dmax, and radiomics, respectively (Supplementary Table 4).For the Developmental dataset, the mean AUC of the 2D Conv-GRU model was significantly higher than that of the 3D Conv-GRU, Dmax, and radiomics model (P = 0.0028, P < 0.0001, and P < 0.0001, respectively).The mean AUC of the 3D Conv-GRU model was significantly higher than that of the radiomics model (P = 0.0003).Finally, the mean AUC of the radiomics model was inferior to that of the Dmax model (P = 0.0015).For the Temporal test set, the mean AUC of the 2D Conv-GRU model was significantly higher than that of the 3D Conv-GRU, Dmax, and radiomics model (P = 0.0005, P < 0.0001, and P = 0.0002, respectively), similar to the finding of the Developmental dataset.The mean AUC of the radiomics model was also inferior to that of the Dmax model (P = 0.0086) (Table 3).In the representative case, the DL model accurately predicted the PD and non-PD cases, despite the temporal changes in solidity and diameter (Fig. 3).In cases where predictions were accurate, the model consistently concentrated on the enhancing BM nodule across all four MRI scans.Conversely, in cases of incorrect predictions, the model often shifted its attention away from the BM nodule.Additionally, viable tumour regions tended to show stronger activation, while areas of post-treatment change showed weaker activation.

Model performance comparison among the follow-up periods
For the Developmental dataset, the AUC pattern of the 2D Conv-GRU model displayed a gradual increment corresponding to the follow-up periods (AUC of 0.6715, 0.6777, 0.777, and 0.878; only pre-SRS MRI, plus 1, 2, and 3 post-SRS MRI(s), respectively).The AUC from the pre-SRS to the third post-SRS follow-up was significantly higher than that of the remaining periods (P < 0.0001).Additionally, the AUC from the pre-SRS to the second post-SRS follow-up was significantly higher than that from the pre-SRS only or from the pre-SRS to the first post-SRS follow-up (P < 0.0001).For the Temporal test set, the AUC of the 2D Conv-GRU model also improved incrementally with the addition of follow-up MRI scans (AUC of 0.5945, 0.6190, 0.7810, and 0.8341; only pre-SRS Table 2. Predictive accuracy of models for assessing treatment response after stereotactic radiosurgery of brain metastasis.AUC area under the receiver-operating characteristic curves, Conv-GRU convolutional neural network with a gated recurrent unit, Dmax prediction model based on maximum axial diameter, SD standard deviation, Sens sensitivity, Spec specificity, 2D two-dimensional, 3D three-dimensional.www.nature.com/scientificreports/MRI, plus 1, 2, and 3 post-SRS MRI(s), respectively).Likewise, the utilisation of all four MRI scans resulted in a significantly higher AUC compared to analyses with fewer scans (P < 0.0001) (Fig. 4, Table 4).

Discussion
This study used longitudinal MRI data to demonstrate the prediction performance for the treatment response after SRS of BM of the DL (2D versus 3D), radiomics, and Dmax models.The 2D Conv-GRU model displayed superior performance relative to that of the 3D Conv-GRU, radiomics, and Dmax models.Moreover, upon evaluating the 2D Conv-GRU model with varying follow-up periods, the prediction accuracy tended to increase with the number of follow-up MRIs.
Clinicians should consider the possibility of tumour progression and radiation necrosis upon observing an initial increase in tumour size or new contrast-enhancing lesions in the treated area after SRS.Despite their vastly different long-term outcomes, it can be challenging to distinguish between the two conditions in the early post-SRS period using conventional MRI 6 .This aspect is primarily attributed to early tumour size changes after SRS that do not always correlate with the long-term response.Several factors, including genetics, age, performance status, radiation dose or regimen, tumour number or size, and histopathology, may contribute to confusion while assessing the treatment response 5,27,28 , thereby delaying confirmative assessment and timely treatment 8 .Whereas advanced MRI techniques, such as diffusion-weighted imaging, perfusion-weighted imaging, and spectroscopy, as well as positron emission tomography, have been evaluated to supplement conventional MRI, they have not yet demonstrated promising results [29][30][31][32] .As such, the RANO-BM working group recommends a multidisciplinary team decision-making process to assess the treatment response instead of relying on a single modality 6 .
A recent systematic review and meta-analysis suggested that the performance of AI-assisted MRI in classifying tumour progression and radiation necrosis after radiotherapy of BM is inadequate for clinical use 18 .The authors identified several issues, such as the need for extensive DL research, consecutive data recruitment that reflects real-world clinical settings, larger sample sizes for robustness, and research using MRI data from multiple time points.Only a limited number of studies have been published on this topic, and the reported performance remains insufficient.Specifically, one study demonstrated AUCs of 0.72 for DL alone and 0.80 for combined DL and radiomics models 17 , highlighting the need for further improvement.Additionally, BM is the most common brain malignancy in adults, and it is relatively easy to obtain a large sample size; therefore, DL research may be a more suitable methodology than radiomics.Multiparametric evaluation is another research trend, which has presented predictive AUCs from 0.71 to 0.86 15,16 .These researchers co-registered multiple MRI sequences into a single template to combine the information, thus enhancing predictive accuracy.However, they typically use single time point MRI data, which are not representative of daily clinical practice.
In addition, few studies have investigated the use of longitudinal MRI analysis to assess the treatment response of BM 33,34 .This phenomenon is primarily attributed to the difficulty in obtaining longitudinal datasets for BM because the size of the dataset is multiplied by the length of the follow-up period.Nevertheless, the treatment response is assessed based on the serial follow-up MRI scans; accordingly, the model should use data from multiple time points for accurate prediction, rather than relying on that from a single time point.Cho et al. 33 developed and validated a DL model to assess automated treatment response using the RANO-BM criteria; however, the model was designed to provide the current treatment response rather than to predict the future treatment response.Lee et al. 34 conducted a tumour habitat analysis using longitudinal MRI data to predict tumour recurrence after SRS.Using a k-means clustering algorithm, they classified each tumour tissue on physiologic MR images (composed of apparent diffusion coefficient and cerebral blood volume images) into nonviable tissue, hypovascular cellular, and hypervascular cellular habitats.Based on the differences between the first and second follow-up MRI scans, an increase in the hypovascular cellular habitat was the most strongly associated with tumour recurrence.
In this novel study, we applied DL models to analyse longitudinal MRI data from more than two time points to predict the BM treatment response.The 2D Conv-GRU model outperformed the radiomics and Dmax models using four-point sequential MRI data from both the Developmental dataset and Temporal test set.This result suggests that CNN encoders can extract more comprehensive information from MRI than handcrafted feature extraction methods, such as Dmax and radiomics, can.In other words, DL models can automatically extract the most relevant features from MRI scans for treatment response prediction.Moreover, the GRU-based decoder in our DL model, which sequentially acquires multiple inputs, effectively handles sequential data, leading to superior results in the longitudinal MRI analysis.The accuracy of the DL model increased gradually as the number of follow-up studies increased, highlighting the importance of longitudinal assessments.However, the trend of the performance increment did not reach a plateau even when using all four time points (Fig. 4).Therefore, extending the observation period beyond four time points may further improve the prediction accuracy, which warrants further investigation.
In this study, we used a modified version of the RANO-BM criteria, which permitted the consideration of BM nodules as small as 5 mm as measurable lesions, which was suggested by the RANO-BM working group 6 .Advances in MRI hardware have facilitated using thin section images (≤ 1.5 mm) for BM evaluation.This modification increases the number of measurable lesions, potentially resulting in greater reliability of the treatment response assessment.Previous computer-aided detection studies using MRI data reported a mean maximum diameter of < 1 cm (5-9 mm) of the BM nodules [35][36][37][38][39] .Hence, adopting a size threshold of 5 mm is reasonable.
This study had some limitations.First, while we conducted an external validation with temporally separated data, we did not utilise data from other institutions.In addition, the relatively small sample size for model training may not sufficiently capture the temporal dynamics of the data.Consequently, we plan to conduct a follow-up multicentre study to evaluate the generalisability of our model.Second, the ground truth was based principally on clinical and radiological information, with only a few cases confirmed by histopathological evaluation.Despite being a common limitation in similar retrospective studies, it may have affected the accuracy of our results.The retrospective design of our study also may have introduced selection bias.Third, this study included MRI scans obtained from both 1.5-T and 3-T scanners, which introduces potential biases due to the inherent differences in image quality and characteristics.However, we noted an even distribution of patient scans across each dataset, which might have neutralised and mitigated the potential biases by a randomisation effect.Fourth, the effect of the follow-up interval between the MRI scans on the results cannot be entirely excluded, despite the small standard deviations of the intervals.Finally, the requirement for pre-processing and segmentation poses a significant challenge to its clinical applicability.Streamlining this process through integration with our picture archiving and communication system could offer substantial benefits.
In conclusion, using longitudinal MRI data, the 2D Conv-GRU model outperformed the 3D Conv-GRU, radiomics, and Dmax models in predicting the treatment response after SRS of BM.Our results suggest that using three post-SRS MRI examinations can achieve the best performance.

Figure 2 .
Figure 2. Model architecture of the proposed 2D convolutional neural network with a gated recurrent unit (Conv-GRU) model.2D, two-dimensional, Conv-GRU convolutional neural network with a gated recurrent unit.

Figure 3 .
Figure 3. Representative cases that were correctly predicted (A,B) and incorrectly predicted (C,D) by the 2D Conv-GRU model.Each image consists of four sequential MRI scans, their corresponding Eigen-CAM images, and the most recent follow-up MRI scans used for establishing ground truth.(A) A PD case histopathologically confirmed via surgery 18 months post-SRS exhibited no significant change in the size between the second and third post-SRS follow-up MRIs.However, the corresponding Eigen-CAM image revealed stronger activation.The 2D Conv-GRU model correctly predicted PD. (B) A non-PD case showed stable disease on 12 months post-SRS.While the size remained stable on the third post-SRS follow-up MRI, the corresponding Eigen-CAM image displayed weaker activation than the second post-SRS follow-up MRI.The 2D Conv-GRU model accurately predicted non-PD.(C) A PD case, confirmed through surgery 17 months post-SRS, showed a decrease in size by the third post-SRS follow-up MRI.The corresponding Eigen-CAM image exhibited weaker activation, leading the 2D Conv-GRU model to mispredict non-PD.(D) A non-PD case remained stable up to 80 months post-SRS.The Eigen-CAM image for the second post-SRS follow-up MRI lost focus on the tumour, resulting in an incorrect PD prediction by the 2D Conv-GRU model.2D two-dimensional, Conv-GRU convolutional neural network with a gated recurrent unit, MRI magnetic resonance imaging, CAM class activation map, PD progressive disease, SRS stereotactic radiosurgery.

Figure 4 .
Figure 4.The area under the receiver-operating characteristic curves of the 2D Conv-GRU model with varying number of follow-up MRI scans.*Indicates a statistically significant difference.2D two-dimensional, Conv-GRU convolutional neural network with a gated recurrent unit, MRI magnetic resonance imaging.

Table 1 .
Table1summarises the characteristics of the enrolled patients, primary cancer types, follow-up intervals between MRI examinations, maximum axial diameter of the BM nodules, and treatment responses as the ground truth.In the Developmental dataset, the enrolled patients (103 men, 91 women) had a mean age of 64.8 years.The mean number of BM nodules was two per patient.The predominant primary cancer types were lung cancer (74.2%), breast cancer (28%), kidney cancer (3.6%), colon cancer (3.6%), and ovarian cancer (1.5%).Furthermore, the bladder, oesophagus, pancreas, peritoneum, and ureter were primary cancer origin sites in only one case each.The mean intervals from the pre-SRS MRI to the first post-SRS MRI, first post-SRS MRI to the second post-SRS MRI, and second post-SRS MRI to the third post-SRS MRI were 2.6, 2.8, and 2.9 months, respectively.The total mean interval from the pre-SRS MRI to the third post-SRS MRI was 8.3 months.Among the 369 enrolled target BM nodules, 88 (23.8%) were classified as PD and 281 (76.2%) were assessed as non-PD, which consisted of Demographic 140 (37.9%)complete responses, 103 (27.9%) partial responses, and 38 (10.3%) stable disease according to the modified RANO-BM criteria.

Table 3 .
P-values of comparison between predictive accuracies of models for assessing treatment response after stereotactic radiosurgery of brain metastasis.*P-values less than 0.0125 indicated a statistically significant difference after Bonferroni correction for 4-axis multiple comparison.

Table 4 .
P-values of comparison with varying number of follow-up MRIs using 2D ConvGRU model for assessing treatment response after stereotactic radiosurgery of brain metastasis.* P-value less than 0.0125 indicated a statistically significant difference after Bonferroni correction for 4-axis multiple comparison.